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An adaptive mesh refinement (AMR) scheme is implemented in a distributed environment using 
Message Passing Interface (MPI) to find solutions to the nonlinear sigma model. Previous work 
studied behavior similar to black hole critical phenomena at the threshold for singularity formation 
in this fiat space model. This work is a follow-up describing extensions to distribute the grid 
hierarchy and presenting tests showing the correctness of the model. 
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The problem of modeling collisions of two black holes 
in general situations poses a number of very difficult 
problems. One among these is finding enough compu- 
tational resources to adequately resolve the physics in a 
reasonable amount of time. Adaptive mesh refinement 
(AMR), whereby fine grids are added where and when 
needed, has proved successful at enabling a tremendous 
dynamic range in resolution. However, even with the ef- 
ficiencies provided by AMR, single machines often fail to 
provide adequate memory or speed. A logical next step 
is then to distribute the AMR problem across a cluster 
of processors. 

Perhaps the most well known use of AMR in relativity 
is Choptuik's implementation of gravitational collapse in 
spherical symmetry . Studying the threshold of black 
hole formation required a large range in resolution to re- 
solve the unique behavior he discovered in so called black 
hole critical phenomena. Other efforts with AMR include 
modeling a single black hole 0, 0i perturbations of a 
black hole H, binary black hole initial data yl, gravita- 
tional waves la , various cosmological models Q , a scalar 
field using characteristic coordinates fixed mesh re- 
finement 131 , orbiting black holes |lOl |. and axisymmetric 
gravitational collapse 0, ^| . 

I describe an implementation of distributed AMR in 
a model problem much simpler than gravitational col- 
lapse, but which nevertheless poses an interesting com- 
putational problem. In particular, I find solutions to the 
nonlinear sigma model using a parallel implementation 
of AMR as a step towards extending the same computa- 
tional infrastructure to solve the gravitational field equa- 
tions. 

Previously [T^ , solutions to this model were found us- 
ing serial AMR (a single machine) which showed a type 
of threshold behavior similar to black hole critical behav- 
ior Families of initial data parameterized by some 
parameter p demonstrated two types of behavior. For 
small p, energy dispersed to infinity (the so-called sub- 
critical regime). For large p, evolutions suggest that a 
singularity forms at finite time {super- critical). Tuning 
between these two states (dispersal and singularity for- 
mation), one finds the critical regime p k p* m. which 
solutions approach a unique solution independent of the 
family with which one begins. This solution is the critical 
solution and demonstrates self-similarity. 

This paper serves as a follow-up to ^| . Presented first 



are the model equations. I then discuss details about the 
implementation of the model and the distributed AMR 
algorithm. Tests of the numerical code follow along with 
a discussion of the parallel performance of the code. 

The model: The nonlinear sigma model provides an 
interesting toy model with which to explore distributed 
AMR. Choosing a generalized hedgehog ansatz 0, the 
dynamics of the model reduce to a single equation of 
motion for a scalar field xi^i Ui -2, t) 



X — X,xx + X,yy ^" X,zz 



sin2x 



(1) 



where commas indicate partial differentiation with re- 
spect to subscripted coordinates, an overdot denotes 

The equation of mo- 



d/dt, and r = y'. 

tion ^ implies the regularity condition x(0,0, 0,t) ~ 
which is enforced by the evolution procedure. The energy 
density associated with this system is given by 
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{Xf + {x..f + {x,y)' + {x,zf 



sin^X 



(2) 



while the angular momentum densities are given by |14| 



(3) 



where the z-component of the angular momentum, for 
example, is 



d^x = d?xx {yx,x - xx,y) ■ (4) 



Initial data for x at t = is set to a generalized Gaussian 
pulse, and then evolved using the equation of motion 
with second-order finite differences in an iterative Crank- 
Nicholson scheme. 

Implementation: Certain solutions of this model re- 
quire a tremendous dynamic range achievable only with 
AMR. The method implemented here follows that of 
Berger and Oliger though with certain simplifica- 
tions. The general strategy is that during the course of 
an evolution, the dynamics arc monitored such that when 
more resolution is needed, finer sub-grids are created in 
the regions which demand it. Concurrently, when the dy- 
namics no longer dictate the existence of fine grids, they 
are removed. The description which follows is intended 
only to present the choices and simplifications specific to 
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this implementation and therefore assumes some famil- 
iarity with the general algorithm as described in [T5j |. 

Consider a grid with uniform grid spacing h = Ax = 
Ay = Az. Some refinement criterion is chosen to deter- 
mine which points on the grid require increased resolu- 
tion, a process called flagging. In this model, points for 
which the energy density obeys h^p > e, for some thresh- 
old value e, are flagged. This condition compares the size 
of the squared derivatives, p, to the resolution squared, 
l/h"^. Other refinement criteria arc widely used including 
estimates of truncation error, but this choice works well 
for this model. 

Once points are flagged, bounding boxes are estab- 
lished by clustering those points into rectangular regions 
where sub-grids of finer resolution will be created. One 
simplification assumed in this code is that grids are com- 
pletely nested within their parent which means that grids 
have unique parents. This restricts the types of cluster- 
ing allowable, and here a very simple method is used. 
A bounding box is found for each disconnected set of 
flagged points. In addition, a buffer region around all 
flagged points is included. Points within two grid points 
of a flagged point are themselves flagged. This buffer- 
ing helps ensure that moving features needing refinement 
stay within refined regions at least until the next refine- 
ment occurs. 

Each grid's resolution is an integral multiple of its par- 
ent's, called the refinement factor. This implementation 
requires this factor to be even. Newly created sub-grids 
are first initialized by linear interpolation in space from 
their parent. Then information from other grids at the 
same resolution (siblings), including those destined for 
removal, is used, where available, to initialize the grid. 
Around the boundaries of regions initialized from sib- 
lings, points are linear interpolated in space to smooth 
out irregularities between the data from the parent and 
from its siblings. 

Advancement in time of a sub-grid follows that of 
coarser grids. In particular, once its parent takes a time 
step, boundary values for advanced time steps of the sub- 
grid are interpolated in time and space from the parent. 
The sub-grid then takes a number of steps equal to the re- 
finement factor until it is time aligned with its parent. At 
that point, the parent grid is injected via half-weighted 
restriction from values in the sub-grid. 

The grid hierarchy can be considered as the union of 
levels, where a level consists of all grids at a given reso- 
lution. Thus the coarse grid constitutes Level 0, all its 
immediate children fill out Level 1 , its grandchildren form 
Level 2, and so on. An example of a grid hierarchy and 
how it is stored is shown in Fig. [31 

A number of strategies exist for distributing an AMR 
code. Hoping to achieve a reasonably simple implementa- 
tion of what is inherently complex, the method adopted 
here is to distribute grids in their entirety to different 
processors. An alternative would be to take a piece of 
every grid and place it on the various processors. Here, 
the different processors "own" different grids and are re- 



sponsible for their time stepping and for any output to 
data files. However, all processors maintain their own 
copy of the grid hierarchy (as shown in Fig.jSJ but do not 
store the fields that live on grids owned by others. Such 
a scheme necessitates that various intcr-grid communica- 
tion traverses the network, except in the cases where the 
involved grids are owned by the same processor. 

One advantage to this strategy is that much of the 
code written for a single processor (the serial code) car- 
ries over directly. In fact, much of the inter-grid routines 
carry over with only minimal changes. For example, to 
provide boundary conditions for an advanced time step of 
a sub-grid, boundary values are interpolated in time from 
a parent grid. In the serial version, such values could be 
calculated in place, whereas in the distributed version 
coarse grid values are interpolated in time on the coarse 
grid's processor, and are then sent to the fine grid's pro- 
cessor where the values are interpolated in space. Such a 
scheme minimizes the amount data needed to be trans- 
mitted and also allows the same routines to be used for 
the serial version just without transmitting the data. 

These grids need to be stepped in coordination so that 
when boundary values are needed by a child grid, the 
parent is advanced to the correct time before those values 
are computed. This coordination is provided by having 
a master processor which owns the coarse grid with all 
other processors entering slave mode. In this mode, they 
continually loop responding to commands passed from 
other processors over the network. 

In the serial version, for each coarse grid time step, 
appropriate actions are taken on entire levels. A similar 
approach works for the distributed code. For example, 
say Level 3 has been advanced, and it is now time to up- 
date the boundary values on all grids on Level 4. Then 
the master processor loops over all grids on Level 4 owned 
by other processors and sends a command to those owner 
processors to update the boundaries of the appropriate 
grids. It then, proceeds by updating the boundary val- 
ues of all Level 4 grids it owns. Finally, it enters slave 
mode in which it listens in case it owns grids which must 
provide such boundary values. Allowing each processor 
more autonomy would likely increase scaling performance 
but would add to the complexity. 

Distribution of the computational problem achieves 
two goals. The first is that the computation is sped up 
with respect to running a single processor. The second 
is that the memory available increases with the number 
of nodes. Ideally, both these scale close to linearly with 
the number of nodes. In the case of speed, optimal scal- 
ing can only be achieved if the processors are kept busy, 
else it is possible for no speed increase to occur. This 
problem is called load balancing and merits quite a bit of 
literature with a variety of schemes. 

Here, such concerns are postponed for later study, and 
instead a nearly trivial scheme is implemented. That 
scheme involves maintaining a table of the workload on 
each processor. Every subgrid of the coarse grid is as- 
signed to the least loaded processor. For other subgrids, 
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FIG. 1: Tests of a typical unigrid evolution with num- 
bers of points: (32 + 1)^ (solid, black), (64 + 1)^ (dot, blue), 
(128 + 1)"^ (short dash, green), and (256 + 1)'^ (long dash, yel- 
low). The top frame shows the convergence factor Q(t) as in 
Eq.|21 For low resolution, convergence is poor, but as resolu- 
tion increases, the convergence rapidly improves, approaching 
the expected value 4. The middle frame shows the loss of 
the ^-component of the angular momentum as a function of 
time, where AJz(f) = Jz{t) — Jz{0)- As the resolution in- 
creases, the loss decreases as would be expected for a conver- 
gent code. The bottom frame shows the loss of energy as 
a function of time. As with the angular momentum, the loss 
converges toward zero. The dimensionless ratio of the angular 
momentum to the energy squared is J/E^ = 0.0072. 



they are assigned to the process which owns their parents 
unless there is a processor with no workload. A number 
of inter-grid operations occur between a parent and a 
child, and hence the scheme favors putting children on 
the same processor as their parent. 

The distributed code is sufficiently similar to the serial 
version that only the distributed version is maintained. 
Of course, it can still be run on a single processor, and 
the incumbent overhead is minimal. 

Another benefit to such a code is that it can accom- 
plish domain decomposition in which the computational 
domain can be partitioned into equal chunks and pro- 
cessed on different processors. Decomposition enables 
one to take advantage of distributed resources without 
adaptive mesh refinement, and is carried out be defining 
the zeroth level to consist of multiple grids which cover 
the entire domain. A future goal is to be able adapt upon 
this decomposed coarse grid. 



FIG. 2: Tests of an AMR evolution. The initial data is equiv- 
alent to that shown in Fig. Q The three runs use identical 
adaptive hierarchies modulo a factor of 2 in resolution. The 
top frame shows the convergence factors for the respective 
coarse grids (solid, black), the Level 1 grids (dot, blue), and 
the Level 2 grids (short dash, green) for the times that they 
exist. The middle and bottom frames show the change 
in angular momentum and energy, respectively computed as 
an integral over the entire grid hierarchies. The data cor- 
respond to runs with coarse grids of resolution: (32 -1-1)^ 
(solid, black), (64 -|- 1)^ (dot, blue), and (128-1-1)^ (short 
dash, green). 



Tests: A very strong test of the code was presented in 
the first paper [iJl in which the obtained critical solution 
was compared with results from a ID code These 
solutions span an extraordinary range of resolutions, and 
that they agree is strong evidence that the codes model 
the same system. 

More traditional tests also provide evidence for the fi- 
delity of the evolutions to the proper equations. In par- 
ticular, one can examine the behavior of numerical solu- 
tions as resolution is increased. Representing a numerical 
solution at some resolution hhy Xh, we can define a con- 
vergence factor 



Qit) = 



1X4/1 - X2h\2 
\X2h - Xh\2 



(5) 



in terms of solutions found for three successive resolutions 
h, 2h, and Ah. If the solutions converge to a unique 
solution, this factor will be greater than unity, and for 
second-order convergence one expects a factor of four. 
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In a similar fashion, one expects numerical solutions 
to maintain approximately conserved quantities. Moni- 
toring changes in the computed energy and angular mo- 
mentum, one expects these changes to converge to zero 
as resolution increases. As shown in Fig. ^ for typical 
initial data, the convergence factor approaches four while 
changes in energy and angular momentum about the z- 
axis converge to zero. 

A similar test is presented in Fig. [3 in which the same 
initial data was tested using AMR. Here, with a coarse 
grid of (33 -f 1)"^ points, an evolution allowing for three 
levels of refinement is run which outputs a history of 
sub-grids created. The resolution of the coarse grid is 
doubled and run again with refinements dictated by the 
refinement history recorded in the previous run. In this 
fashion, adaptive runs can be duplicated with the only 
change being a multiple of the base resolution. These 
results indicate convergence and can be compared to the 
results in Fig. ^ 

The code has also shown to give the same results to 
within machine precision when run on a varying number 
of processors. 

Performance: As already mentioned, the efforts 
described here are aimed at achieving better performance 
by utilizing multiple processors, but are not expected to 
achieve optimal performance. With that said, it makes 
sense to examine the performance as a function of the 
number of processors. 

With the design calling for entire grids to be placed on 
processors (as opposed to breaking all grids into smaller 
grids), there are two regimes of expected performance. 
If each Level consists of a single grid, then the grids are 
nested within each other and the grid hierarchy is com- 
pletely vertical. In this regime, generally little speedup is 
expected. The reason is that the finest grid will usually 
require the most work, and hence all other processors will 
generally have to wait for the finest grid to complete be- 
fore proceeding. Because the other grids take much less 
work than the finest grid, if one were to run such an evo- 
lution on a single processor, it would take about as long 
as on many. 

The other regime is where there are many grids at the 
same resolution with about the same number of points. 
For this case, the grid hierarchy is spread horizontally. 
For such a system, the processors can truly act in parallel, 
and a significant speedup is expected. 

As a first test, such a scenario (see Fig. I^J is imple- 
mented by forcing the code to create some number of 
equal sized grids as children of the coarsest grid. We can 
define a speedup factor S in terms of the time r„ it takes 
for the code to complete on n dual processor nodes as 



5 = ^. 



(6) 



T\ is the time taken on one dual processor node (^t.g. two 
processors). 

These speedup factors are shown in Fig. 01 for two 
variations. The first is that these subgrids remain fixed 





FIG. 3: 2D slice of initial data with 125 sub-grids. The 
data shows five equal-sized sub-grids in each dimension. The 
situation for two, three, and four sub-grids per side (8, 27, 
64 total sub-grids) is similar. For so called "static" runs, no 
re-refinement was done, and for "moving" the sub-grids were 
successively moved one grid point with each refinement. 



while the second has the subgrids moved a single grid 
point every few time steps. The latter variation is to 
simulate grids which move and their incumbent network 
overhead. The figure shows that the execution time is 
sped up but the improvement generally saturates begin- 
ning when the number of processors exceeds half of the 
total number of grids. In other words, good performance 
is predicated on having many grids per processor. 

A less contrived test would simply compare run times 
on various numbers of processors. However, for this 
model, the essential dynamics consist of central collapse, 
and hence a typical, well resolved evolution would contain 
a number of grids largely nested within each other in the 
so-called vertical regime. As such, significant speedup is 
not expected, but there is benefit in that distributing the 
problem provides gains in memory storage. 

Conclusion: These results indicate the validity of 
the implementation of the nonlinear sigma model, and in 
particular support the assertion that, within the hedge- 
hog ansatz but away from spherical symmetry, the criti- 
cal solution between dispersion and singularity formation 
is the same as that found under the assumption of spher- 
ical symmetry jl6| . 

This work serves as an initial step toward modeling 
general gravitational collapse with distributed adaptive 
mesh refinement. Much work remains, including the im- 
plementation of more sophisticated clustering and load 
balancing. 
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FIG. 4: Scaling performance on an Itanium cluster 
(titan.ncsa.uiuc.edu at NCSA). Pairs of data are shown, 
one for so-called "static" and one for "moving" runs of a cer- 
tain number of sub-grids. In general, the better performing 
of the pair (the one on top) is the "static." The various cases 
consist of: plus signs (-I-) for the 8 sub-grid case; crosses (x) 
for the 27 sub-grid case; asterisks (*) for the 64 sub-grid case; 
and boxes (□) for the 125 sub-grid case. The dashed line in- 
dicates ideal, linear scale-up. The figure indicates significant 
speed-up as long as an average of at least two grids exist per 
process (in other words, the number of nodes is less than a 
quarter the number of sub-grids). 
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FIG. 5: Diagram detailing the structure of the stored grid hierarchy. Shown is a simple example of a three-level tree with grids 
represented as rectangles. Each grid has a unique grid number and also stores its parent, its next sibling, and a single child. In 
instances in which a sibling/child/parent does not exist, a null pointer is used. The hierarchy also stores the process id which 
"owns" the grid (and therefore stores the associated fields defined on that grid). Levels, which consist of all (sub-)grids at a 
given resolution are represented in the diagram as ellipses. Levels store only the beginning grid number in the hierarchy, with 
subsequent grids obtained by traversing the sibling pointers until a null pointer is reached. 



